Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

نویسندگان

  • Rion Snow
  • Brendan T. O'Connor
  • Daniel Jurafsky
  • Andrew Y. Ng
چکیده

Human linguistic annotation is crucial for many natural language processing tasks but can be expensive and time-consuming. We explore the use of Amazon’s Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web. We investigate five tasks: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation. For all five, we show high agreement between Mechanical Turk non-expert annotations and existing gold standard labels provided by expert labelers. For the task of affect recognition, we also show that using non-expert labels for training machine learning algorithms can be as effective as using gold standard annotations from experts. We propose a technique for bias correction that significantly improves annotation quality on two tasks. We conclude that many large labeling tasks can be effectively designed and carried out in this method at a fraction of the usual expense.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Frame Semantics Annotation Made Easy with DBpedia

Crowdsourcing techniques applied to natural language processing have recently experienced a steady growth and represent a cheap and fast, albeit valid, solution to create benchmarks and training data. Nevertheless, some particularly complex tasks such as semantic role annotation have been rarely conducted in a crowdsourcing environment, due to their intrinsic difficulty. In this paper, we prese...

متن کامل

Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference

Natural Language Engineering tasks require large and complex annotated datasets to build more advanced models of language. Corpora are typically annotated by several experts to create a gold standard; however, there are now compelling reasons to use a non-expert crowd to annotate text, driven by cost, speed and scalability. Phrase Detectives Corpus 1.0 is an anaphorically-annotated corpus of en...

متن کامل

Designing an Expert System for Internet Connection Problems Troubleshooting for wired network users

Man, is living in an era that the knowledge is estimated to be doubled in a relatively short time. The fast rate of technology's growth in the "Century of information", is caused by fast growth of communication technologies like the internet which has become one of the best tools for a quick, cheap, effective and vastly supported communication. For an efficient and effective usage of tools and ...

متن کامل

Designing an Expert System for Internet Connection Problems Troubleshooting for wired network users

Man, is living in an era that the knowledge is estimated to be doubled in a relatively short time. The fast rate of technology's growth in the "Century of information", is caused by fast growth of communication technologies like the internet which has become one of the best tools for a quick, cheap, effective and vastly supported communication. For an efficient and effective usage of tools and ...

متن کامل

Preliminary Experience with Amazon’s Mechanical Turk for Annotating Medical Named Entities

Amazon’s Mechanical Turk (MTurk) service is becoming increasingly popular in Natural Language Processing (NLP) research. In this paper, we report our findings in using MTurk to annotate medical text extracted from clinical trial descriptions with three entity types: medical condition, medication, and laboratory test. We compared MTurk annotations with a gold standard manually created by a domai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008